{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Transformer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "This tutorial is available as an IPython notebook at [Malaya/example/topic-modeling-transformer](https://github.com/huseinzol05/Malaya/tree/master/example/topic-modeling-transformer).\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n",
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import malaya"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('tests/02032018.csv',sep=';')\n",
    "df = df.iloc[3:,1:]\n",
    "df.columns = ['text','label']\n",
    "corpus = df.text.tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can get this file https://github.com/huseinzol05/malaya/blob/master/tests/02032018.csv . **This csv already stemmed.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load vectorizer object\n",
    "\n",
    "You can use `TfidfVectorizer`, `CountVectorizer`, or any vectorizer as long `fit_transform` method exists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from malaya.text.vectorizer import SkipGramCountVectorizer\n",
    "\n",
    "stopwords = malaya.text.function.get_stopwords()\n",
    "vectorizer = SkipGramCountVectorizer(\n",
    "    max_df = 0.95,\n",
    "    min_df = 1,\n",
    "    ngram_range = (1, 3),\n",
    "    stop_words = stopwords,\n",
    "    skip = 2\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Transformer\n",
    "\n",
    "We can use Transformer model to build topic modeling for corpus we have, the power of attention!\n",
    "\n",
    "```python\n",
    "def attention(\n",
    "    corpus: List[str],\n",
    "    n_topics: int,\n",
    "    vectorizer,\n",
    "    cleaning=simple_textcleaning,\n",
    "    stopwords=get_stopwords,\n",
    "    ngram: Tuple[int, int] = (1, 3),\n",
    "    batch_size: int = 10,\n",
    "):\n",
    "    \"\"\"\n",
    "    Use attention from malaya.transformer model to do topic modelling based on corpus given.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    corpus: list\n",
    "    n_topics: int, (default=10)\n",
    "        size of decomposition column.\n",
    "    vectorizer: object\n",
    "    cleaning: function, (default=malaya.text.function.simple_textcleaning)\n",
    "        function to clean the corpus.\n",
    "    stopwords: List[str], (default=malaya.texts.function.get_stopwords)\n",
    "        A callable that returned a List[str], or a List[str], or a Tuple[str]\n",
    "    ngram: tuple, (default=(1,3))\n",
    "        n-grams size to train a corpus.\n",
    "    batch_size: int, (default=10)\n",
    "        size of strings for each vectorization and attention.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: malaya.topic_model.transformer.AttentionTopic class\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'mesolitica/roberta-base-bahasa-cased': {'Size (MB)': 443},\n",
       " 'mesolitica/roberta-tiny-bahasa-cased': {'Size (MB)': 66.1},\n",
       " 'mesolitica/bert-base-standard-bahasa-cased': {'Size (MB)': 443},\n",
       " 'mesolitica/bert-tiny-standard-bahasa-cased': {'Size (MB)': 66.1},\n",
       " 'mesolitica/roberta-base-standard-bahasa-cased': {'Size (MB)': 443},\n",
       " 'mesolitica/roberta-tiny-standard-bahasa-cased': {'Size (MB)': 66.1},\n",
       " 'mesolitica/electra-base-generator-bahasa-cased': {'Size (MB)': 140},\n",
       " 'mesolitica/electra-small-generator-bahasa-cased': {'Size (MB)': 19.3}}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "malaya.transformer.available_huggingface"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.\n"
     ]
    }
   ],
   "source": [
    "electra = malaya.transformer.huggingface(model = 'mesolitica/electra-base-generator-bahasa-cased')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "You're using a ElectraTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.\n",
      "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
      "To disable this warning, you can either:\n",
      "\t- Avoid using `tokenizers` before the fork if possible\n",
      "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n"
     ]
    }
   ],
   "source": [
    "attention = malaya.topic_model.transformer.attention(corpus, n_topics = 10, vectorizer = electra)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get topics\n",
    "\n",
    "```python\n",
    "def top_topics(\n",
    "    self, len_topic: int, top_n: int = 10, return_df: bool = True\n",
    "):\n",
    "    \"\"\"\n",
    "    Print important topics based on decomposition.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    len_topic: int\n",
    "        size of topics.\n",
    "    top_n: int, optional (default=10)\n",
    "        top n of each topic.\n",
    "    return_df: bool, optional (default=True)\n",
    "        return as pandas.DataFrame, else JSON.\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>topic 0</th>\n",
       "      <th>topic 1</th>\n",
       "      <th>topic 2</th>\n",
       "      <th>topic 3</th>\n",
       "      <th>topic 4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>pertumbuhan</td>\n",
       "      <td>kenyataan</td>\n",
       "      <td>malaysia</td>\n",
       "      <td>malaysia</td>\n",
       "      <td>kerajaan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>hutang</td>\n",
       "      <td>kwsp</td>\n",
       "      <td>negara</td>\n",
       "      <td>berita</td>\n",
       "      <td>menteri</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>pendapatan</td>\n",
       "      <td>kerajaan</td>\n",
       "      <td>kerajaan</td>\n",
       "      <td>rakyat</td>\n",
       "      <td>penjelasan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>harga</td>\n",
       "      <td>dana</td>\n",
       "      <td>pengalaman</td>\n",
       "      <td>memalukan</td>\n",
       "      <td>laporan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>malaysia</td>\n",
       "      <td>menulis</td>\n",
       "      <td>berkongsi</td>\n",
       "      <td>kapal</td>\n",
       "      <td>malaysia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>projek</td>\n",
       "      <td>dakwaan</td>\n",
       "      <td>perancangan</td>\n",
       "      <td>wang</td>\n",
       "      <td>perdana</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>kaya</td>\n",
       "      <td>mahkamah</td>\n",
       "      <td>berkongsi pengalaman</td>\n",
       "      <td>berita palsu</td>\n",
       "      <td>pemilihan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>peningkatan</td>\n",
       "      <td>tindakan</td>\n",
       "      <td>kementerian</td>\n",
       "      <td>palsu</td>\n",
       "      <td>ros</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>kenaikan</td>\n",
       "      <td>pertimbangan</td>\n",
       "      <td>impian</td>\n",
       "      <td>negara</td>\n",
       "      <td>tn</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>penerima</td>\n",
       "      <td>menulis kenyataan</td>\n",
       "      <td>pendekatan</td>\n",
       "      <td>buku</td>\n",
       "      <td>isu</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       topic 0            topic 1               topic 2       topic 3  \\\n",
       "0  pertumbuhan          kenyataan              malaysia      malaysia   \n",
       "1       hutang               kwsp                negara        berita   \n",
       "2   pendapatan           kerajaan              kerajaan        rakyat   \n",
       "3        harga               dana            pengalaman     memalukan   \n",
       "4     malaysia            menulis             berkongsi         kapal   \n",
       "5       projek            dakwaan           perancangan          wang   \n",
       "6         kaya           mahkamah  berkongsi pengalaman  berita palsu   \n",
       "7  peningkatan           tindakan           kementerian         palsu   \n",
       "8     kenaikan       pertimbangan                impian        negara   \n",
       "9     penerima  menulis kenyataan            pendekatan          buku   \n",
       "\n",
       "      topic 4  \n",
       "0    kerajaan  \n",
       "1     menteri  \n",
       "2  penjelasan  \n",
       "3     laporan  \n",
       "4    malaysia  \n",
       "5     perdana  \n",
       "6   pemilihan  \n",
       "7         ros  \n",
       "8          tn  \n",
       "9         isu  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "attention.top_topics(5, top_n = 10, return_df = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get topics as string\n",
    "\n",
    "```python\n",
    "def get_topics(self, len_topic: int):\n",
    "    \"\"\"\n",
    "    Return important topics based on decomposition.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    len_topic: int\n",
    "        size of topics.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: List[str]\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  'pertumbuhan hutang pendapatan harga malaysia projek kaya peningkatan kenaikan penerima'),\n",
       " (1,\n",
       "  'kenyataan kwsp kerajaan dana menulis dakwaan mahkamah tindakan pertimbangan menulis kenyataan'),\n",
       " (2,\n",
       "  'malaysia negara kerajaan pengalaman berkongsi perancangan berkongsi pengalaman kementerian impian pendekatan'),\n",
       " (3,\n",
       "  'malaysia berita rakyat memalukan kapal wang berita palsu palsu negara buku'),\n",
       " (4,\n",
       "  'kerajaan menteri penjelasan laporan malaysia perdana pemilihan ros tn isu'),\n",
       " (5,\n",
       "  'memudahkan mengundi rakyat sasaran mewujudkan berkembang memudahkan rakyat nilai impak mengundi mewujudkan impak mengundi'),\n",
       " (6,\n",
       "  'teknikal berkembang mdb bincang kerja duit selesaikan lancar berlaku kerajaan duit'),\n",
       " (7,\n",
       "  'bayar keputusan bahasa stres kebenaran selesaikan pekan dipecat selesaikan terdekat ambil'),\n",
       " (8,\n",
       "  'parti umno pas bersatu perlembagaan ros undi keputusan pendaftaran harapan'),\n",
       " (9,\n",
       "  'projek rendah gembira mempercayai kebajikan berjalan menjaga kebajikan rakyat malaysia gembira projek')]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "attention.get_topics(10)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
